notebook.community

Edit and run

Let's analyze which words appear most often on a given day relative to a baseline. We can also do an n-grams-style line chart of which words tend to appear over time the most.



In [2]:

    
from jupyter_cms.loader import load_notebook

eda = load_notebook('./data_exploration.ipynb')

df, newspapers = eda.load_data()



In [6]:

    
import pandas as pd

pd.set_option('display.max_columns', 100)



In [7]:

    
df.head(3)









    Out[7]:






  
    
      
      text
      fontface
      fontsize
      bbox_left
      bbox_bottom
      bbox_right
      bbox_top
      bbox_area
      avg_character_area
      percent_of_page
      page
      page_width
      page_height
      page_area
      date
      day_of_week
      weekend
      slug
      id
      page_height_round
      page_width_round
      page_width_round_10
      page_height_round_10
      aspect_ratio
    
  
  
    
      9
      A blast on a subway killed 11 \npeople Monday....
      FrugalSans-Light-DTC
      12.177
      55.062
      230.002
      166.022
      268.579
      4280.50392
      50.108339
      0.003269
      1
      801.0
      1634.976
      1309615.776
      2017-04-04
      1
      False
      VA_DNR
      40876
      1634
      801
      800
      1630
      0.5
    
    
      11
      Donna R. Adkins, 68, \nBridgewater \nCalvin E....
      CKLDCP+FranklinGothic-Demi
      12.570
      368.620
      249.330
      489.200
      322.180
      8784.25300
      56.935536
      0.006708
      1
      801.0
      1634.976
      1309615.776
      2017-04-04
      1
      False
      VA_DNR
      40878
      1634
      801
      800
      1630
      0.5
    
    
      12
      Nancy M. Lawrence, 86, Elkton \nOlive V. Roop,...
      CKLIDL+FranklinGothic-Book
      12.850
      509.320
      249.330
      639.739
      322.180
      9501.02415
      56.830592
      0.007255
      1
      801.0
      1634.976
      1309615.776
      2017-04-04
      1
      False
      VA_DNR
      40879
      1634
      801
      800
      1630
      0.5



In [10]:

    
print('''Rows: {}
Dates: {} ({} - {})
'''.format(
    df.shape[0],
    df.date.nunique(),
    df.date.min(),
    df.date.max()
))









    



Rows: 796938
Dates: 105 (2017-04-01 00:00:00 - 2017-07-23 00:00:00)



In [14]:

    
import spacy

nlp = spacy.load('en')



In [41]:

    
docs = []
for i, doc in enumerate(nlp.pipe(df.text, batch_size=10000, n_threads=7)):
    if i % 5000 == 0:
        print('.', end='')
    docs.append(doc)









    



................................................................................................................................................................



In [268]:

    
def remove_token(t):
    return not t.is_alpha or t.is_stop



In [270]:

    
lemmas = []

for d in docs:
    d_lemmas = []
    for t in d:
        if not remove_token(t):
            d_lemmas.append(t.lemma_)
    
    lemmas.append(d_lemmas)



In [271]:

    
import itertools
df['lemmas'] = lemmas



In [346]:

    
newspaper_text = df.groupby(['date']).lemmas.apply(lambda x: list(itertools.chain(*x)))
newspapers_per_day = df.groupby(['date']).slug.nunique()



In [357]:

    
import sys
from collections import Counter

newspaper_tfs = []

# tf - number of times word shows up in current document
# doc_freqs - number of documents that has a given word

for i, d in enumerate(newspaper_text):
    if i % 10000 == 0:
        print('.', end='')
        sys.stdout.flush()
    tf = Counter(d)
    newspaper_tfs.append(tf)

To detect anomolously high usage of a word on a day, we want several things:

High usage of the word relative to a baseline (every other day)
High usage of the word in general. For example, it isn't as meaningful to have a single word used once, and to say that it is infinitely more common in that one particular document.
To avoid the situation where a document itself will overwhelm the entire corpus through frequent usage of a term.

We also want a multi-level model that will estimate two things in tandem: the presence of clusters of topics, as well as anomolous vocabulary usage within each topic.

Already we can compare two numbers: how frequently a word appears in the text vs. how frequently it appears in other documents that mention that word.



In [398]:

    
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt



In [423]:

    
fig = plt.figure(figsize=(12, 5))
plot1 = fig.add_subplot(131)
plot2 = fig.add_subplot(132)
plot3 = fig.add_subplot(133)

plot1.plot(range(len(newspaper_text)), [len(text) for text in newspaper_text])
plot1.set_xlabel("Day of scrape")
plot1.set_ylabel("Words")

plot2.plot(range(len(day_vocab_props)), [len(x) for x in day_vocab_props])
plot2.set_xlabel("Day of scrape")
plot2.set_ylabel("Unique words")

plot3.plot(range(len(newspapers_per_day)), newspapers_per_day.values)
plot3.set_xlabel("Day of scrape")
plot3.set_ylabel("Number of newspapers")
plt.tight_layout()



In [438]:

    
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer()
X = dv.fit_transform(newspaper_tfs)



In [468]:

    
from sklearn.feature_selection import chi2

def get_labels_for_day(day, N):
    arr = np.zeros(N)
    arr[day] = 1
    
    return arr

get_labels_for_day(2, 5)









    Out[468]:





array([ 0.,  0.,  1.,  0.,  0.])



In [ ]:

    
N = len(newspaper_tfs)
words = np.array(dv.get_feature_names())



In [502]:

    
top_words_by_day = []

for i in range(N):
    print('.', end = '')
    sys.stdout.flush()
    
    keyness, _ = chi2(X, get_labels_for_day(i, N))
    ranking = np.argsort(keyness)[::-1]
    top_words = words[ranking]
    top_words_by_day.append(list(zip(top_words, keyness[ranking])))









    



.........................................................................................................



In [506]:

    
sum([sys.getsizeof(x) for x in top_words_by_day])









    Out[506]:





148002120



In [536]:

    
for date, top_words in zip(newspaper_meta, top_words_by_day):
    print('.', end='')
    sys.stdout.flush()
    
    date_str = pd.to_datetime(str(date)).strftime('%Y-%m-%d')
    
    with open('results/top-words/{}.csv'.format(date_str), 'w') as out:
        out.write('\n'.join([','.join([line[0], str(np.round(line[1], 2))]) for line in top_words]))









    



.........................................................................................................



In [539]:

    
newspaper_day_text = df.groupby(['date', 'slug']).lemmas.apply(lambda x: list(itertools.chain(*x)))



In [549]:

    
newspaper_day_meta = df.groupby(['date', 'slug']).first().reset_index()[['date', 'slug']]



In [654]:

    
newspaper_day_tf = []

for lemmas in newspaper_day_text:
    newspaper_day_tf.append(Counter([lemma for lemma in lemmas if len(lemma) > 2]))



In [655]:

    
dv = DictVectorizer()
X = dv.fit_transform(newspaper_day_tf)



In [656]:

    
newspaper_day_tf = np.array(newspaper_day_tf)



In [620]:

    
def get_day(day):
    date = newspaper_day_meta.date.unique()[day]
    return newspaper_day_meta[newspaper_day_meta.date == date].index



In [608]:

    
def get_slug_in_day(slug, day):
    date = newspaper_day_meta.date.unique()[day]
    ndf = newspaper_day_meta[newspaper_day_meta.date == date].reset_index()
    return ndf[ndf.slug == slug].index[0]



In [615]:

    
def get_day_slugs(day):
    date = newspaper_day_meta.date.unique()[day]
    return newspaper_day_meta[newspaper_day_meta.date == date].slug.values



In [ ]:

    
top_words_by_slug_day = []
words = np.array(dv.get_feature_names())

for i in range(N):
    print('.', end = '')
    sys.stdout.flush()
    
    day_ix = get_day(i)
    X_universe = X[day_ix, ]
    
    day_slugs = get_day_slugs(i)
    M = len(day_slugs)
    for slug in day_slugs:
        j = get_slug_in_day(slug, i)
        
        keyness, _ = chi2(X_universe, get_labels_for_day(j, M))
        ranking = np.argsort(np.nan_to_num(keyness))[::-1]
        top_words = words[ranking[:100]]
        top_words_by_slug_day.append(top_words)









    



...





    



---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-657-7fd924890306> in <module>()
     14         j = get_slug_in_day(slug, i)
     15 
---> 16         keyness, _ = chi2(X_universe, get_labels_for_day(j, M))
     17         ranking = np.argsort(np.nan_to_num(keyness))[::-1]
     18         top_words = words[ranking[:100]]

/home/sam/workspace/frontpages/analysis/venv/lib/python3.5/site-packages/sklearn/feature_selection/univariate_selection.py in chi2(X, y)
    225     expected = np.dot(class_prob.T, feature_count)
    226 
--> 227     return _chisquare(observed, expected)
    228 
    229 

/home/sam/workspace/frontpages/analysis/venv/lib/python3.5/site-packages/sklearn/feature_selection/univariate_selection.py in _chisquare(f_obs, f_exp)
    166         chisq /= f_exp
    167     chisq = chisq.sum(axis=0)
--> 168     return chisq, special.chdtrc(k - 1, chisq)
    169 
    170 

KeyboardInterrupt:



In [ ]:

    
len(top_words_by_slug_day)



In [ ]:

    
1



In [ ]:

    
1

	text	fontface	fontsize	bbox_left	bbox_bottom	bbox_right	bbox_top	bbox_area	avg_character_area	percent_of_page	page	page_width	page_height	page_area	date	day_of_week	weekend	slug	id	page_height_round	page_width_round	page_width_round_10	page_height_round_10	aspect_ratio
9	A blast on a subway killed 11 \npeople Monday....	FrugalSans-Light-DTC	12.177	55.062	230.002	166.022	268.579	4280.50392	50.108339	0.003269	1	801.0	1634.976	1309615.776	2017-04-04	1	False	VA_DNR	40876	1634	801	800	1630	0.5
11	Donna R. Adkins, 68, \nBridgewater \nCalvin E....	CKLDCP+FranklinGothic-Demi	12.570	368.620	249.330	489.200	322.180	8784.25300	56.935536	0.006708	1	801.0	1634.976	1309615.776	2017-04-04	1	False	VA_DNR	40878	1634	801	800	1630	0.5
12	Nancy M. Lawrence, 86, Elkton \nOlive V. Roop,...	CKLIDL+FranklinGothic-Book	12.850	509.320	249.330	639.739	322.180	9501.02415	56.830592	0.007255	1	801.0	1634.976	1309615.776	2017-04-04	1	False	VA_DNR	40879	1634	801	800	1630	0.5

Trending words